Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes crd per-version validation field path #84560

Conversation

yue9944882
Copy link
Member

@yue9944882 yue9944882 commented Oct 30, 2019

/sig api-machinery
/kind bug
/cc @sttts @liggitt @kubernetes/sig-api-machinery-misc

found this when resolving CI failures for #84005 , i dont think there will be compatiblity issues if we reword it directly?

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. kind/bug Categorizes issue or PR as related to a bug. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 30, 2019
@liggitt
Copy link
Member

liggitt commented Oct 30, 2019

@sttts can you verify this will not trigger the dueling CRD condition messages? (this is the exact scenario the generation checking code was intended to protect against, but a verification would be good)

@yue9944882 yue9944882 force-pushed the bugfix/crd-per-version-validation-field-path branch from bc0a784 to 65d2fe0 Compare October 30, 2019 15:42
@yue9944882 yue9944882 force-pushed the bugfix/crd-per-version-validation-field-path branch from 65d2fe0 to 0153473 Compare October 30, 2019 15:44
@yue9944882
Copy link
Member Author

yue9944882 commented Oct 30, 2019

currently the protection only works by checking whether the message is identical, which will not be helping the case where there are two apiservers writing different messages/conditions in HA setup:

T0: for two apiservers in a cluster, one SVR1, the other SVR2
T1: the non-structural controllers in both server watched events and triggered reconciliation
T2: SVR1 successfully updated the condition w/ TXT1 and bumped lastSeenGeneration on finish.
T3: SVR2 failed recondition due to update conflict, and will be triggered another retry soon after
T4: SVR2 reconcile again and overwrite the condition to TXT2 and bump its lastSeenGeneration.

the condition will end up twiddling from TXT1 to TXT2, it can be even worse if the replicas are multiplied...

@liggitt
Copy link
Member

liggitt commented Oct 30, 2019

This protects against updating condition more than once for the same generation:

// avoid repeated calculation for the same generation
c.lastSeenGenerationLock.Lock()
lastSeen, seenBefore := c.lastSeenGeneration[inCustomResourceDefinition.Name]
c.lastSeenGenerationLock.Unlock()
if seenBefore && inCustomResourceDefinition.Generation <= lastSeen {
return nil
}

@liggitt
Copy link
Member

liggitt commented Oct 30, 2019

condition is in status and does not modify metadata.generation

@@ -112,7 +112,7 @@ func calculateCondition(in *apiextensions.CustomResourceDefinition) *apiextensio
return cond
}

pth := field.NewPath("spec", "version").Key(v.Name).Child("schema", "openAPIV3Schema")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an associative list? if so I think keyed by name is at least a somewhat acceptable way to index it. (version->versions seems correct though.)

But how to avoid fighting with a prior apiserver during an upgrade?

I'm not a fan of our current mechanism at all, it seems super brittle. Changes like this one shouldn't be so fraught with danger :/

Copy link
Member

@liggitt liggitt Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field path is intended to output the jsonpath to the field, so this should be an index

But how to avoid fighting with a prior apiserver during an upgrade?

we already protected against that, I just wanted a sanity check the mechanism was working

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field path is intended to output the jsonpath to the field, so this should be an index

Sounds like something that could (and therefore should) be tested...

@yue9944882
Copy link
Member Author

This protects against updating condition more than once for the same generation:

the protection only works within one apiserver. in an HA setup where the replicas are partially upgraded, e.g. M replicas updating old message and N replicas updating new message, the condition message can be flipping at worst min(M, N) times, isn't it? @liggitt

@liggitt
Copy link
Member

liggitt commented Nov 4, 2019

in an HA setup where the replicas are partially upgraded, e.g. M replicas updating old message and N replicas updating new message, the condition message can be flipping at worst min(M, N) times, isn't it? @liggitt

yes, though the normal HA replicas upgrade scenario which starts with M replicas in steady-state, then does a rolling upgrade to N replicas of the new version would only update the message a single time.

@sttts sttts added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Nov 6, 2019
@k8s-ci-robot k8s-ci-robot removed the needs-priority Indicates a PR lacks a `priority/foo` label and requires one. label Nov 6, 2019
@sttts
Copy link
Contributor

sttts commented Nov 7, 2019

@liggitt Why should index based errors cause more flapping than name based? I remember that we talked about flapping in the review, but I don't remember the reasoning for the current code.

Or is it just this change which you fear will cause flapping? Sure it will, for a 3 node HA cluster this will cause max 6 updates of the condition, per CRD. That does not sound like a big deal IMO.

@liggitt
Copy link
Member

liggitt commented Nov 7, 2019

@liggitt Why should index based errors cause more flapping than name based? I remember that we talked about flapping in the review, but I don't remember the reasoning for the current code.

It's not specific to index-based errors, this is just the first message change I'm aware of where we have old apiservers in play, and I wanted a double-check of our flapping mitigation.

Or is it just this change which you fear will cause flapping? Sure it will, for a 3 node HA cluster this will cause max 6 updates of the condition, per CRD. That does not sound like a big deal IMO.

Agree.

@yue9944882
Copy link
Member Author

any more action item required in this pull?

@liggitt
Copy link
Member

liggitt commented Nov 7, 2019

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 7, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liggitt, yue9944882

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 7, 2019
@k8s-ci-robot k8s-ci-robot merged commit 9c5952e into kubernetes:master Nov 7, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note-none Denotes a PR that doesn't merit a release note. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants